perm filename PATREC[4,KMC]12 blob
sn#109206 filedate 1974-07-03 generic text, type T, neo UTF8
00100 COMMENT ā VALID 00002 PAGES
00200 C REC PAGE DESCRIPTION
00300 C00001 00001
00400 C00002 00002
00500 C00039 ENDMK
00600 Cā;
00100
00200
00300 PATTERN-MATCHING RULES FOR THE RECOGNITION OF
00400 NATURAL LANGUAGE DIALOGUE EXPRESSIONS
00500
00600 Kenneth Mark Colby
00700 and
00800 Roger C. Parkison
00900
01000
01100 INTRODUCTION
01200
01300 To recognize something is to identify it as an instance of
01400 the "same again". This familiarity is possible because of recurrent
01500 characteristics of the world which repeat themselves. We shall
01600 describe an algorithm which recognizes recurrent characteristics of
01700 natural language dialogue expressions. It utilizes a multi-stage
01800 sequence of pattern-matching rules for progressively transforming an
01900 input expression until it eventually matches an abstract stored
02000 pattern. The stored pattern has a pointer to a response function in
02100 memory which decides what to do once the input has been recognized.
02200 Here we discuss only the recognizing functions, except for one
02300 response function (anaphoric substitution) which interactively aids
02400 the recognition process. Details of how the response functions
02500 operate will be described in a future communication by William Faught
02600 and ourselves.
02700 We are constructing and testing a simulation of paranoid
02800 thought processes; our problem is to reproduce paranoid linguistic
02900 behavior in a teletyped diagnostic psychiatric interview. The
03000 diagnosis of paranoid states, reactions or modes is made by
03100 clinicians who judge the degree of correspondence between what they
03200 observe in an interview and their conceptual model of paranoid
03300 behavior. There exists a high degree of agreement among
03400 psychiatrists about this conceptual model which relies mainly on what
03500 an interviewee says and how he says it.
03600 Natural language is a life-expressing code which people use
03700 for communication with themselves and others. In a real-life
03800 dialogue such as a psychiatric interview, the participants have
03900 interests, intentions, and expectations which are revealed in their
04000 linguistic expressions. An interactive simulation of a paranoid
04100 patient must be able to demonstrate typical paranoid linguistic
04200 behavior. To achieve this effect, our paranoid model must have the
04300 ability to deal with the teletyped messages of an interviewer.
04400 A number of approaches have been taken for dealing with
04500 natural language dialogue expressions. (Winograd,1972; Woods,1970).
04600 These approaches rely on parsers which conduct a detailed syntactic
04700 and semantic analysis. They perform well for the purposes for which
04800 they were designed. Their weakness, for our purposes, lies in their
04900 lack of neglecting and ignoring mechanisms. Such mechanisms are
05000 necessary in a program which accepts and responds to unrestricted
05100 conversational English characterized by expressions novel to the
05200 program.
05300 How humans process natural language is largely unknown. They
05400 possess some knowledge of grammatical rules, but this fact does not
05500 entail that they use a grammar in interpreting and producing
05600 language. It seems implausible to us that people possess full
05700 transformational grammars for processing language. Language is
05800 what is recognized but the processes involved may not be linguistic
05900 or grammatical. Originally transformational grammars were not
06000 designed to "understand" a large subset of English; they constituted
06100 a formal method for deciding whether a string is grammatical.
06200 An analysis of what one's problem actually is should guide
06300 the selection or invention of methods appropriate to its solution.
06400 Our problem is not to develop a consistent and general theory of
06500 language nor to assert empirically testable hypotheses about how
06600 people process language. Our problem is to design an algorithm
06700 which recognizes what is being said in a dialogue and what is being
06800 said about it in order to make a response such that a sample of I-O
06900 pairs from the paranoid model is judged similar to a sample of I-O
07000 pairs from paranoid patients. The design task belongs to
07100 artificial intelligence in which the criterion is whether the
07200 computer program performs mind-like functions. New methods had to be
07300 devised for an algorithm to participate in a human dialogue in a
07400 paranoid-patient-like way. We sought effective methods which could
07500 operate efficiently in real time. Since our method provides a
07600 general way of many-to-one mapping from surface expressions to a
07700 single stored pattern, it is not limited to the simulation of
07800 paranoia, but can be used by any type of "host" system which takes
07900 natural language as input.
08000 Our method is to transform the input until a pattern is
08100 obtained which matches completely or partially a more abstract stored
08200 pattern. This strategy has proved adequate for our purposes a
08300 satisfactory percentage of the time. The power of this method for
08400 natural language dialogues lies in its ability to ignore as
08500 irrelevant some of what it recognizes and everything it does not
08600 recognize at all. A linguitic parser doing word-by-word,
08700 parts-of-speech analysis fails when it cannot find one or more of the
08800 input words in its dictionary. A system that must know every word is
08900 too fragile for unrestricted dialogues.
09000 In early versions of the paranoid model, such as PARRY1, some
09100 of the pattern recognition mechanisms allowed the elements of the
09200 pattern to be order independent (Colby, Weber, and Hilf, 1971). For
09300 example, consider the following expressions:
09400 (1) WHERE DO YOU WORK?
09500 (2) WHAT SORT OF WORK DO YOU DO?
09600 (3) WHAT IS YOUR OCCUPATION?
09700 (4) WHAT DO YOU DO FOR A LIVING?
09800 (5) WHERE ARE YOU EMPLOYED?
09900 In PARRY1 a procedure scans these expressions looking for an
10000 information-bearing contentive such as "work", "for a living", etc.
10100 When it finds such a contentive along with "you" or "your" in the
10200 expression, regardless of word order, it responds to the expression
10300 as if it were a question about the nature of one's work. This method
10400 correctly classifies the five sentences above. Unfortunately, it
10500 includes the two examples below in the same category:
10600 (6) DOES YOUR FATHER'S CAR WORK?
10700 (7) HOW DID THINGS WORK OUT FOR YOU?
10800 An insensitivity to word order has the advantage that lexical items
10900 representing different parts of speech can represent the same
11000 concept,e.g. the word "work" represents the same concept whether is
11100 used as a noun or a verb. But a price is paid for this resilience and
11200 elasticity. We find from experience that, since English relies
11300 heavily on word order to convey the meaning of its messages, the
11400 average penalty of misunderstanding (to be distinguished from
11500 ununderdstanding), is too great. Hence in PARRY2, as will be
11600 described shortly, all the patterns require a specified word order.
11700 For high-complexity problems it is helpful to have
11800 constraints. Diagnostic psychiatric interviews (and especially
11900 those conducted over teletypes) have several natural constraints.
12000 First, clinicians are trained to ask certain questions in certain
12100 ways. This limits the number of patterns required to recognize
12200 utterances about each topic. Second, only a few hundred standard
12300 topics are brought up by interviewers who are, furthermore, trained
12400 to use everyday expressions and especially those used by the patient
12500 himself. When the interview is conducted by teletypes, expressions
12600 tend to be shortened since the interviewer tries to increase the
12700 information transmission rate over the slow channel of a teletype.
12800 Finally, teletyped interviews represent written utterances and
12900 utterances are known to be highly redundant such that unrecognized
13000 words can be ignored without losing the meaning of the message. Also
13100 utterances are loaded with idioms, cliches, pat phrases, etc. -
13200 all being easy prey for a pattern-matching approach. It is
13300 time-wasting and usually futile to try to decode an idiom by
13400 analyzing the meanings of its individual words.
13500 We now describe the pattern-matching functions of the
13600 algorithm in some detail. (See Fig. 1 for a diagram of the overall
13700 flow of control).
13800
13900
14000 OVERVIEW
14100
14200 PARRY2 has two primary modules. The first attempts to
14300 RECOGNIZE the input and the second RESPONDS. This paper is primarily
14400 about the RECOGNIZE module. It functions independently of the
14500 RESPOND module except in the case of pronoun references, which the
14600 RESPOND module provides to the RECOGNIZER on request.
14700 The recognition module has 4 main steps:
14800 1) Identify the words in the question and convert them to
14900 internal synonyms.
15000 2) Break the input into segments at certain bracketing words.
15100 3) Match each segment (independently) to a stored pattern.
15200 4) Match the resulting list of recognized segments to a stored
15300 compound pattern.
15400 Each of these steps, except the segmenting, throws away what it
15500 cannot identify. Occasionally a reference to an unknown topic is
15600 mis-recognized as some familiar topic.
15700
15800 PREPROCESSING
15900
16000 Each word in the input expression is first looked up in a
16100 dictionary of(currently) about 1900 entries which, for the sake of
16200 speed, is maintained in core during run-time. Illustrative examples
16300 of dictionary entries are given in Appendix 2. The dictionary, which
16400 was built empirically from thousands of teletyped interviews with
16500 previous versions of the model, consists of words, groups of words,
16600 and names of word-classes they can be translated into. The size of
16700 the dictionary is determined by the patterns, i.e. a word is not
16800 included unless it appears in one or more patterns. Entries in the
16900 dictionary reflect PARRY2's main interests. If a word in the input
17000 is not in the dictionary, it is dropped from the pattern being
17100 formed. Thus if the input is:
17200 WHAT IS YOUR CURRENT OCCUPATION?
17300 and the word "current" is not in the dictionary, the pattern at
17400 this stage becomes:
17500 ( WHAT IS YOUR OCCUPATION )
17600 The question-mark is thrown away as redundant since questions are
17700 recognized by word order. (A statement followed by a question mark
17800 (YOU GAMBLE?) is responded to in the same way as that statement
17900 followed by a period.) Synonymic translations of words are made so
18000 that the pattern becomes, for example:
18100 ( WHAT BE YOU JOB )
18200 Some groups of words (i.e. idioms) are translated as a group
18300 so that, for example, "for a living" becomes "for job". Certain other
18400 juxtaposed words are contracted into a single word, e.g. "place of
18500 birth" becomes "birthplace". This is done to deal with groups of
18600 words which are represented as a single element in the stored
18700 pattern, thereby preventing segmentation from occurring at the wrong
18800 places, such as at a preposition inside an idiom or phrase. Besides
18900 these contractions, certain expansions are made so that for example,
19000 "DON'T" becomes "DO NOT" and "I'D" becomes "I WOULD".
19100 Misspellings can be the bane of teletyped interviews for an
19200 algorithm. Here they are handled in two ways. First, common
19300 misspellings of important words are simply put in the dictionary.
19400 Thus "yoy" is known to mean "you". The apostrophe is often omitted
19500 from contractions so most contractions are recognized with or without
19600 it. These common misspellings were gathered from over 4000 interviews
19700 with earlier versions of the paranoid model.
19800 Second, five common forms of typing error are checked
19900 systematically. These are:
20000 1) Doubled letter
20100 2) Extraneous letter
20200 3) Forgetting to hold the "shift key" for an apostrophe
20300 4) Hitting a nearby key on the keyboard
20400 5) Transposing two letters in a word
20500 The first three errors can be corrected by deleting the offending
20600 character from the word. This is accomplished by deleting each
20700 character in turn until the word is recognized. The fourth type of
20800 error is only checked for 10 of the more common near misses. These
20900 were also empirically determined and include letter pairs like (T Y),
21000 (O P), (A S), and (N M). These methods are all based on typing
21100 errors, but they also correct some legitimate English spelling
21200 errors. Two-letter transposition corrects, for example, "beleive"
21300 to "believe".
21400
21500 SEGMENTING
21600
21700 Another weakness in the crude pattern matching of PARRY1 is
21800 that it takes the entire input expression as its basic processing
21900 unit. If only two words are recognized in an eight word utterance,
22000 the risk of misunderstanding is great. We need a way of dealing with
22100 units shorter than the entire input expression.
22200 Aided by a heuristic from work in machine-translation (Wilks,
22300 1973 ), we devised a way of bracketing the pattern constructed up to
22400 this point into shorter segments using prepositions, wh-forms,
22500 certain verbs, etc. as bracketing points. (A list of the bracketing
22600 terms appears in Fig. 2). These points tend to separate
22700 prepositional phrases and embedded clauses from the main clause.
22800 The new pattern formed is termed either "simple", having no
22900 delimiters within it, or "compound", i.e., being made up of two or
23000 more simple patterns. A simple pattern might be:
23100 ( WHAT BE YOU JOB )
23200 whereas a compound pattern would be:
23300 (( WHY BE YOU ) ( IN HOSPITAL ))
23400 Our experience with this method of segmentation shows that compound
23500 patterns from teletyped psychiatric dialogues rarely consist of more
23600 than three or four segments.
23700 After certain verbs (See Fig. 4) a bracketing occurs to
23800 replace the commonly omitted "THAT", such that:
23900 ( I THINK YOU BE AFRAID )
24000 becomes
24100 (( I THINK ) ( YOU BE AFRAID ))
24200
24300 MATCHING INDIVIDUAL SEGMENTS
24400
24500 Conjunctions serve only as markers for the segmenter and they
24600 are dropped out after segmentation.
24700 Negations are handled by extracting the "NOT" from the
24800 segment and assigning a value to a global variable which indicates
24900 that the expression is negative in form. When a pattern is finally
25000 matched, this variable is consulted. Some patterns have a pointer to
25100 a pattern of opposite meaning if a "NOT" could reverse their
25200 meanings. If this pointer is present and a "NOT" was found, then
25300 the pattern matched is replaced by its opposite, e.g. ( I not trust
25400 you ) is replaced by the pattern ( I mistrust you ). We have not yet
25500 observed the troublesome case of "he gave me not one but two
25600 messages". (There is no need to scratch where it doesn't itch).
25700 Substitutions are also made in certain cases. Some segments
25800 contain pronouns which could stand for a number of different things
25900 of importance to PARRY2. As we mentioned in the introduction,
26000 the response functions of memory keep track of the context in order
26100 to give pronouns and other anaphoras a correct interpretation. For
26200 example, the segment:
26300 ( DO YOU AVOID THEM )
26400 could refer to the Mafia, or racetracks, or other patients, depending
26500 on the context. When such a segment is encountered, the pronoun is
26600 replaced by its current anaphoric value as determined by the response
26700 functions, and a more specific segment such as:
26800 ( DO YOU AVOID MAFIA )
26900 is looked up.
27000 Other utterances, such as "Why did you do that?" or just
27100 "Why?" (which might be regarded as a massive ellipsis), clearly refer
27200 back to previous utterances. These utterances match very general
27300 patterns which identify the type of question without indicating the
27400 exact topic. The response function which responds to "Why?" consults
27500 the context to produce an appropriate answer.
27600 The algorithm next attempts to match the segments with stored
27700 simple patterns which currently number about 1700. (Examples of
27800 simple patterns appear in Appendix 3). First a complete and perfect
27900 match is sought. When a match is found, the stored pattern name has
28000 a pointer to the name of a response function in memory which decides
28100 what to do further. If a match is not found, further transformations
28200 of the segment are carried out and a "fuzzy" match is tried.
28300 For fuzzy matching at this stage, we adopted the heuristic
28400 rule of dropping elements in the segment one at a time and attempting
28500 a match each time. This heuristic allows ignoring familiar words in
28600 unfamiliar contexts. For example, "well" is important in "Are you
28700 well?" but meaningless in "Well are you?".
28800 Deleting one element at a time results in, for example, the
28900 pattern:
29000 ( WHAT BE YOU MAIN PROBLEM )
29100 becoming successively:
29200 (a) ( BE YOU MAIN PROBLEM )
29300 (b) ( WHAT YOU MAIN PROBLEM )
29400 (c) ( WHAT BE MAIN PROBLEM )
29500 (d) ( WHAT BE YOU PROBLEM )
29600 (e) ( WHAT BE YOU MAIN )
29700 Since the stored pattern in this case matches (d), (e) would not be
29800 constructed. We found it unwise to delete more than one element
29900 since our segmentation method usually yields segments containing a
30000 small number (1-4) of words.
30100 Dropping an element at a time provides a probability
30200 threshold for fuzzy matching which is a function of the length of
30300 the segment. If a segment consists of five elements, four of the five
30400 must be present in a particular order (with the fifth element missing
30500 in any position) for a match to occur. If a segment contains four
30600 elements, three must match - and so forth.
30700
30800 COMPOUND-PATTERN MATCH
30900
31000 When more than one simple pattern is detected in the input, a
31100 second matching is attempted against about 500 compound patterns.
31200 Certain patterns, such as ( HELLO ) and ( I THINK ), are dropped
31300 because they are considered meaningless. If a complete match is not
31400 found, then simple patterns are dropped, one at a time, from the
31500 complex pattern. This allows the input,
31600 (( HOW DO YOU COME ) ( TO BE ) ( IN HOSPITAL ))
31700 to match the stored pattern,
31800 (( HOW DO YOU COME ) ( IN HOSPITAL )).
31900
32000 If no match can be found at this point, the algorithm has
32100 arrived at a default condition and the appropriate response functions
32200 decide what to do. For example, in a default condition, the model
32300 may assume control of the interview, asking the interviewer a
32400 question, continuing with the topic under discussion or introducing a
32500 new topic.
32600 An annotated example of a diagnostic psychiatric interview is
32700 presented in Appendix 1.
32800
32900
33000 ADVANTAGES AND LIMITATIONS
33100
33200 As mentioned, one of the main advantages of a
33300 pattern-matching strategy is that it can ignore as irrelevant both
33400 some of what it recognizes and what it does not recognize at all.
33500 There are several million words in English, each possessing from one
33600 to over a hundred senses. To construct a machine-usable word
33700 dictionary of this magnitude is out of the question at this time.
33800 Recognition of natural language input in the manner described above
33900 allows real-time interaction in a dialogue since it avoids becoming
34000 ensnarled in combinatorial disambiguations and long chains of
34100 inferencing which would slow a dialogue algorithm down to
34200 impracticality, if it could even function at all. The price paid for
34300 pattern-matching is that sometimes, but rarely, ambiguities slip
34400 through.
34500 A drawback to PARRY1 is that it reacts to the first pattern
34600 it finds in the input rather than characterizing the input as fully
34700 as possible and then deciding what to do based on a number of tests.
34800 Another practical difficulty with PARRY1 from a programmer's
34900 viewpoint, is that elements of the patterns are strung out in various
35000 procedures throughout the algorithm. It is often a considerable
35100 chore for the programmer to determine whether a given pattern is
35200 present and precisely where it is. In PARRY2 the patterns are all
35300 collected in one part of the data-base where they can easily be
35400 examined.
35500 Concentrating all the patterns in the data base gives PARRY2
35600 a limited "learning" ability. When an input fails to match any
35700 stored pattern or matches an incorrect one, as judged by a human
35800 operator, a pattern which matches the input can be put into the
35900 data-base automatically. If the new pattern has the same meaning as
36000 a previously stored pattern, the human operator must provide the name
36100 of the appropriate response function. If he doesn't remember the
36200 name, he may try to rephrase the input in a form recognizable to
36300 PARRY2 and it will name the response function associated with the
36400 rephrasing. These mechanisms are not "learning" in the commonly used
36500 sense but they do allow a person to transfer his knowledge into
36600 PARRY2's data-base with very little effort.
36700 Informal observation thus far shows PARRY2's linguistic
36800 recognition abilities to be quite superior to PARRY1's. A more
36900 systematic and quantitative evaluation of performance is now being
37000 carried out. Parry1 was extensively tested by having judges make
37100 ratings of its performance along several dimensions, one of which was
37200 linguistic noncomprehension (Colby and Hilf, 1974). These judges also
37300 made ratings of teletyped interviews with psychiatric patients and
37400 with a random version of PARRY1. Once the ratings of PARRY2 along a
37500 linguistic dimension completed, we will be able to compare them with
37600 the ratings of PARRY1 and measure the improvement of the model along
37700 this dimension.
37800
37900
38000 REFERENCES
38100
38200 Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia.
38300 ARTIFICIAL INTELLIGENCE, 2, 1-25.
38400
38500 Colby, K.M. and Hilf, F.D. (1974). Multidimensional Evaluation of
38600 of a Computer Simulation of Paranoid Thought. To appear
38700 in KNOWLEDGE AND COGNITION, L. Gregg, (Ed.)
38800 Wilks, Y. (1973). An Artificial Intelligence Approach to Machine
38900 Translation. In COMPUTER MODELS OF THOUGHT AND
39000 LANGUAGE, R.C.Schank and K.M. Colby, Eds., W.H. Freeman,
39100 San Francisco.